Introduction

Welcome to A Primer on Quantile Regression for Epidemiologists! We are excited to have you here with us today! This document will serve as a companion to the slide deck which forms the basis of our discussion in today’s workshop.

Workshop background

Our workshop builds off of a paper we have written as an introduction to quantile regression methods for epidemiologists. This paper is entitled “Quantile regressions as a tool to evaluate how an exposure shifts and reshapes the outcome distribution: A primer for epidemiologists” and is available on American Journal of Epidemiology at this link. The paper was supported by a grant from the National Institute on Aging (R01 AG069092, PI Vable).

While no longer with our team, this workshop would not be possible without Aayush Khadka, first author of our didactic paper.

Overall learning aims

The learning aims of this workshop can be grouped into two categories: theory and practice. In terms of theory, our aim is to introduce you to quantile regression concepts by contrasting quantile regression with mean regression and, additionally, contrasting estimators targeted at quantiles of the conditional versus unconditional outcome distribution. In terms of practice, our goal is to provide you with code to conduct quantile regression analyses, to run various post-estimation tests, to present quantile regression coefficients, and to visualize how quantile regression results can be used to demonstrate the distributional effects of an exposure.

Practical example

The quantile regression code in this workshop will be presented in the context of an analysis investigating the relationship between educational attainment and later-life systolic blood pressure (SBP). This is a question that occupies a lot of our time in VabLab because education has been shown to have a strong, inverse relationship with average SBP; however, since there may be a non-linear relationship between SBP and the risk of coronary heart disease and stroke, it is imperative that we understand education’s potential role in reshaping the SBP distribution.

Our data for the practical example comes from the US Health and Retirement Study, which is a nationally representative (when using their weights), multi-cohort, biennially conducted longitudinal study of non-institutionalized US adults who are 50 years or older. Our analytic sample includes US born HRS participants who were first interviewed in 1998 or later, and had no missing covariate information (N = 8,875). Since blood pressure started being measured in the HRS in 2006, our study period ranged from 2006 to 2018.

We use self-reported total years of schooling as our exposure variable and first recorded SBP as our outcome variable. Total years of schooling takes on integer values between 5 and 17 years, with 5 indicating that the respondent had 5 or fewer years of schooling and 17 indicating that the respondent had 17 or more years of schooling. We do not use repeated measures of SBP in our data to keep the exposition of quantile regression ideas clear and uncomplicated. In our analysis, we will also include covariates such as age (linear + quadratic term), sex (female and male persons), race (Black, Latinx, White), whether the respondent was born in the US South, parental education as a marker of childhood socioeconomic status, and the year in which SBP was measured.

#install.packages("tidyverse")
library(tidyverse) #Data cleaning/manipulation

#Load dataset
data <- read_rds("QRWorkshop.rds")
summary(data)
##        id            age             age2          gender           race      
##  Min.   :   1   Min.   :51.00   Min.   :2601   Min.   :1.000   Min.   :1.000  
##  1st Qu.:2220   1st Qu.:54.00   1st Qu.:2916   1st Qu.:1.000   1st Qu.:1.000  
##  Median :4438   Median :57.00   Median :3249   Median :1.000   Median :1.000  
##  Mean   :4438   Mean   :59.79   Mean   :3645   Mean   :1.468   Mean   :1.335  
##  3rd Qu.:6656   3rd Qu.:62.00   3rd Qu.:3844   3rd Qu.:2.000   3rd Qu.:2.000  
##  Max.   :8875   Max.   :97.00   Max.   :9409   Max.   :2.000   Max.   :3.000  
##     southern         schlyrs          mom_ed          dad_ed     
##  Min.   :0.0000   Min.   : 5.00   Min.   : 5.00   Min.   : 5.00  
##  1st Qu.:0.0000   1st Qu.:12.00   1st Qu.:10.00   1st Qu.: 8.00  
##  Median :0.0000   Median :14.00   Median :12.00   Median :12.00  
##  Mean   :0.3781   Mean   :13.75   Mean   :11.33   Mean   :10.86  
##  3rd Qu.:1.0000   3rd Qu.:16.00   3rd Qu.:12.00   3rd Qu.:12.00  
##  Max.   :1.0000   Max.   :17.00   Max.   :17.00   Max.   :17.00  
##       year           sbp       
##  Min.   :2006   Min.   : 73.0  
##  1st Qu.:2008   1st Qu.:114.0  
##  Median :2010   Median :125.5  
##  Mean   :2011   Mean   :127.6  
##  3rd Qu.:2012   3rd Qu.:138.5  
##  Max.   :2018   Max.   :233.0
#Format categorical variables
data$gender <- factor(data$gender)
data$race <- factor(data$race)
data$year <- factor(data$year)

#Centering school years at 12 years of schooling
data$schlyrs_c <- data$schlyrs - 12

The specifics of the variables included in our dataset are provided below:

  • ID: Unique identifier
  • Age: Age of participant at the time of SBP measurement (linear/quadratic)
  • Gender: Gender of participant
     - 1 = Female
     - 2 = Male
  • Race: Race of participant
     - 1 = non-Hispanic White
     - 2 = non-Hispanic Black
     - 3 = Hispanic/Latinx
  • Southern: Birthplace of participant
     - 0 = Not born in the Southern US
     - 1 = Born in the Southern US
  • Schlyrs: Self-reported total years of education
     - 5-17 years
  • Schlyrs_c: Schlyrs centered at 12 years (to help aid the interpretation of the intercept later)
     - -7-5 years
  • Mom_ed: Mother’s education
     - 5-17 years
  • Dad_ed: Father’s education
     - 5-17 years
  • Year: Year of SBP measurement
     - 2006-2018
  • SBP: Systolic blood pressure (mmHg)
     - 73-233

Teaching resources

All materials can be found on our repository. Please don’t hesitate to contact any of us with questions!

Abbreviations

When we first entered the public health/epidemiology world, we were shocked by how many abbreviations there were. And now, we’ve reached the stage of our careers where we too are using a plethora of abbreviations in our daily lives. Unfortunately, this workshop too has fallen prey to our love for abbreviations. The following table lists all the abbreviations that you will see in this workshop and their full form. Please feel free to refer back to in whenever you want.

Abbreviation Full form
CDF Cumulative Distribution Function
CEF Conditional Expectation Function
CQF Conditional Quantile Function
CQR Conditional Quantile Regression
DGP Data Generating Process
IF Influence Function
OLS Ordinary Least Squares
RIF Recentered Influence Function
SBP Systolic Blood Pressure
UQR Unconditional Quantile Regression

A matter of life and distributions

In his seminal work “Sick Individuals and Sick Populations”, Geoffrey Rose wrote about the need to reduce the incidence of disease through population-level interventions that shift the distribution of risk factors. Rose contrasted this approach with the more standard approach of prevention which focused on targeting those at high risk of disease and treating them, often with some form of medication. Additionally, Rose believed that quantifying the impact of population-level interventions on the distribution of risk factors could be done by quantifying how the intervention affects the mean of the risk factor distribution.

Rose’s ideas have influenced generations of epidemiologists, including us. Several epidemiologists have expanded his arguments to show that focusing simply on how the mean of the risk factor distribution changes may not give us full insight into how an intervention is affecting the entire distribution. In fact, these scholars have argued that we must explicitly test how a population-level intervention is affecting the tails of the risk factor distribution, especially as some of society’s most vulnerable people are located in the tails.

Blood pressure is a striking example to illustrate the need to focus on the tails of a distribution. Many studies have indicated that blood pressure may have a non-linear relationship with the risk of other poor health outcomes such as coronary heart disease or stroke. This means that a 20 unit decrease in blood pressure at high levels (say, from 180 mmHg to 160 mmHg) reduces the risk of poor health outcomes significantly more than a 20 unit decrease at low levels (say, from 140 mmHg to 120 mmHg). As such, a population-level intervention that is able to reduce blood pressure more at higher levels relative to lower levels may have a greater potential to improve population health than, say, an intervention which affects the blood pressure distribution uniformly.

Empirical epidemiology has, to date, largely focused on modeling how interventions affect the mean of an outcome. Unfortunately, focusing only on how an intervention affects the mean may not provide us with a good idea of how the same intervention is affecting the entire outcome distribution. Consider the following figure: we see that mean models are capable of capturing distributional effects only when an intervention uniformly affects the outcome distribution (“location shift”); however, when an intervention affects the variance of the distribution (“scale shift”) or both the location and scale of the outcome distribution, then mean models cannot capture distributional effects of an intervention.

Location and Scale Shifts

#install.packages("gridExtra")
library(gridExtra) #Printing ggplots together

set.seed(2468) #Remove seed to create different distribution plots

#Location shift
cont <- rnorm(100, mean = 0, sd = 1)
loc_t <- cont + 3

ls <- data.frame("Outcome" = c(cont, loc_t),
                 "Group" = factor(c(rep("Control", length(cont)),
                                    rep("Treatment", length(loc_t)))))

location_shift <- ggplot(aes(Outcome, color = Group, linetype = Group), 
                         data = ls) +
  geom_density() +
  xlim(-10, 10) +
  ylab("Density") +
  scale_color_manual(values = c("#F2494C", "#468C8A")) + 
  theme_classic() +
  theme(strip.background = element_blank(),
        legend.position = "bottom")


#Scale shift
scl_t <- rnorm(100, mean = 0, sd = 3)

ss <- data.frame("Outcome" = c(cont, scl_t),
                 "Group" = factor(c(rep("Control", length(cont)),
                                    rep("Treatment", length(scl_t)))))

scale_shift <- ggplot(aes(Outcome, color = Group, linetype = Group),
                      data = ss) +
  geom_density() +
  xlim(-10, 10) +
  ylab("Density") +
  scale_color_manual(values = c("#F2494C", "#468C8A")) + 
  theme_classic() +
  theme(strip.background = element_blank(),
        legend.position = "bottom")


#Location and Scale shift
ls_t <- scl_t + 3

lss <- data.frame("Outcome" = c(cont, ls_t),
                  "Group" = factor(c(rep("Control", length(cont)),
                                     rep("Treatment", length(ls_t)))))

loc_scale_shift <- ggplot(aes(Outcome, color = Group, linetype = Group), 
                          data = lss) +
  geom_density() +
  xlim(-10, 10) +
  ylab("Density") +
  scale_color_manual(values = c("#F2494C", "#468C8A")) + 
  theme_classic() +
  theme(strip.background = element_blank(),
        legend.position = "bottom")

#Final plot
grid.arrange(location_shift, scale_shift, loc_scale_shift, 
             ncol = 2, nrow = 2)

We argue that quantile regressions can offer us a way of understanding how an exposure affects the entire outcome distribution. In what follows we will demonstrate how quantile regressions allow us to investigate distributional effects. But first, we’ll set the stage up for quantile regression by taking a brief foray into the world of means, quantiles, and mean models, in particular, linear regression.

A refresher on means and quantiles

Means

When we’re talking about means, we’re focusing on the arithmetic mean. We are probably all familiar with the idea of a mean or average of a random variable. Essentially, the mean represents the “center of mass” of a variable’s distribution.

We’re probably also all aware that we can estimate the mean of a random variable by taking a probability weighted sum of all elements of a random variable (e.g., in the case of a continuous variable, we would integrate the product of each element of the variable and the probability of its occurrence). Or, more simply, we could add up all the elements of a variable and divide the sum by the total number of elements (something that’s hard to do when we have a truly continuous variable with many elements).

However, there is also another approach to finding the mean of a random variable. This approach relies on solving a minimization problem. Specifically, we can find the mean of a random variable \(Y\) by minimizing:

\[ min_a \sum_{i=1}^{N}(y_i - a)^2 \]

Here, \(a\) is any value from the range of \(Y\). What this minimization problem says is that the value of \(Y\), denoted here by \(a\), which minimizes the sum of the squared difference between itself and all other values of \(Y\) is the mean of \(Y\). In other words, the mean minimizes the sum of squared differences.

What’s interesting is that we can extend this optimization framework to finding means of the conditional distribution of \(Y\). For instance, let \(X\) be the conditioning variable. Then, we can find the mean of \(Y|X\) (\(Y\) conditional on \(X\)) by minimizing:

\[ min_\beta \sum_{i=1}^{N} (y_i - g(x_i, \beta))^2 \] Here, \(g(x_i, \beta)\) is an arbitrary function of the conditioning variable. This minimization is something that you may have already seen before, for in a very specific context. For instance, let’s assume that \(g(x_i, \beta)\) can be written as a linear in parameters equation: \(\beta_0 + \beta_1 x_i\). In this case, we can write the minimization problem as:

\[ min_{\beta_0, \beta_1} \sum_{i=1}^{N} (y_i - \beta_0 - \beta_1x_i)^2 \] Notice, that this is the famous “minimizing the sum of squared errors” for a least squares regression! In fact, the whole reason for showing you all the optimization approach to finding means (and later, quantiles) is that this optimization serves as the basis for linear regression (and later, quantile regression)!

One final point about means: there’s a very nice link between the means of the conditional distribution and the mean of the unconditional distribution that’s captured by the Law of Iterated Expectations. The Law of Iterate Expectation sounds fancy, but the underlying idea is something that we might already be familiar with. Basically, the Law of Iterated Expectation says that the unconditional mean of a variable (\(E[Y]\)) can be found by taking a probability weighted sum of all the conditional means (\(E[Y|X]\)) (i.e., \(E[Y] = E_x[E[Y|X]]\)). Let’s illustrate this in the context of SBP in our data where the conditioning variable is age group (<60 years, 60-69 years, 70-79 years, and 80+ years).

Unconditional and Conditional Means

#Unconditional mean
mean(data$sbp) #127.63
## [1] 127.6303
#Conditional means
data <- data %>%
  mutate(age_group = case_when(age < 60 ~ "<60",
                               between(age, 60, 69) ~ "60-69",
                               between(age, 70, 79) ~ "70-79",
                               age >= 80 ~ "80+"))
data %>%
  group_by(age_group) %>%
  summarise(mean_sbp = mean(sbp),
            n = n(),
            product = mean_sbp*n)
## # A tibble: 4 × 4
##   age_group mean_sbp     n product
##   <chr>        <dbl> <int>   <dbl>
## 1 60-69         130.  1925 251048.
## 2 70-79         134.   613  82178.
## 3 80+           135.   524  70994.
## 4 <60           125.  5813 728498.
#Law of Iterated expectation in action. Here we're using information from the table above to multiply each conditional mean by the probability of that age group occurring in the data. The probability is calculated inside the parentheses.
125.3223*(5813/8875) + 130.4148*(1925/8875) + 134.0595*(613/8875) + 135.4838*(524/8875)
## [1] 127.6303

We will see the Law of Iterated Expectation show its full glory when we’re talking about unconditional quantile regressions later during this workshop. Additionally, it is the Law of Iterated Expectations that also allows us to interpret results from a plain vanilla linear regression model as the effect of an exposure on the unconditional mean of the outcome.

Quantiles

When we say quantiles, think “percentiles”. We’ll continue to use the term quantiles in this workshop since we’re talking about quantile regressions, although sometimes we’ve heard quantile regressions be called “percentile regressions” as well. We would advise you to stick with the term quantile regression as it is more standard and avoids confusion.

For a given random variable \(Y\), the \(\tau^{th}\) quantile is that value of \(Y\) such that \(\tau\)% of the values lie below it. For example, the \(50^{th}\) quantile of \(Y\) is the value of \(Y\) such that 50% of observations lie below it. Similarly, the \(25^{th}\) and \(75^{th}\) quantiles of \(Y\) are those values of \(Y\) such that 25% and 75% of observations lie below them. The \(50^{th}\) quantile is better known by the term “median”. Sometimes, the \(50^{th}\), \(25^{th}\), and \(75^{th}\) quantiles may also be called the 0.50, 0.25, and 0.75 quantiles. Finally, please note that quantiles aren’t restricted only to the \(50^{th}\), \(25^{th}\), or \(75^{th}\) quantiles. \(\tau\) could take any value between 0 and 1, so you can define any quantile between 0 and 1: 0.10 quantile, 0.18 quantile, 0.36 quantile, 0.45 quantile, 0.90 quantile…the list is actually infinite!

Obtaining quantile values in R

It’s easy to find quantiles of a random variable in R. Let’s take a look at how we’d find quantiles of the unconditional SBP distribution:

as.numeric(quantile(data$sbp, probs = c(0.10, 0.25, 0.50, 0.75, 0.90)))
## [1] 104.5 114.0 125.5 138.5 153.5
q10 <- as.numeric(quantile(data$sbp, probs = 0.10)) 
q25 <- as.numeric(quantile(data$sbp, probs = 0.25))
q50 <- as.numeric(quantile(data$sbp, probs = 0.50))
q75 <- as.numeric(quantile(data$sbp, probs = 0.75))
q90 <- as.numeric(quantile(data$sbp, probs = 0.90)) 

Finding quantiles through optimization

Quantiles have an interesting parallel to means, in the sense that we can estimate quantiles by solving an optimization problem too. For a random variable \(Y\), we can find its \(\tau^{th}\) quantile by minimizing:

\[ min_a \Sigma_{i=1}^{N} \rho_\tau(y_i - a). \]

This minimization looks similar to the minimization we did for finding means because both involve minimizing a sum. In both cases, \(a\) refers to a value of \(Y\) from the range of \(Y\). However, what’s inside the sum is different: in the case of means, we were minimizing the sum of squared deviations from \(a\) while here we’re minimizing the deviation of all values of \(Y\) from \(a\) but these deviations are plugged into the \(\rho_\tau\) function.

It’s worth interrogating the \(\rho_\tau\) function a bit more since it shows up again in conditional quantile regressions. The function \(\rho_\tau\) is defined as follows:

\[ \rho_\tau(u) = u(\tau - I(u<0)) \]

where \(\tau\) refers to the quantile of interest, \(u\) is the argument of the function, and \(I(u<0)\) is the indicator function which takes the value 1 when \(u<0\) and 0 otherwise. We use \(u\) to make the expression of \(\rho_\tau\) convenient; however, you can replace \(u\) with \(y_i - a\) to express \(\rho_\tau\) in a way that conforms to the minimization we saw above:

\[ \rho_\tau(y_i - a) = (y_i - a) (\tau - I(y_i - a < 0)). \] The \(\rho_\tau\) function is also known as the “check function” because of how it sketches out a check mark when we plot its points. In our presentation, we show the check mark plot in the case of our data with \(a = 100\) and \(a = 120\) and SBP being the random variable of interest.

Just as with means, we can find quantiles of the conditional distribution of a random variable by solving an optimization problem as well.

For instance, let \(X\) be the conditioning variable for a random variable \(Y\). Then, we can find the \(\tau^{th}\) quantile of \(Y|X\) (\(Y\) conditional on \(X\)) by minimizing:

\[ min_\beta \sum_{i=1}^{N} \rho_\tau(y_i - g(x_i, \beta)). \] Here, \(g(x_i, \beta)\) is an arbitrary function of the conditioning variable. We saw how in the case of means, the minimization to find the conditional means serves as the foundation for linear regression. The same is true for conditional quantile regressions as well! For instance, let’s assume that \(g(x_i, \beta)\) can be written as a linear in parameters equation: \(\beta_0 + \beta_1 x_i\). In this case, we can write the minimization problem as:

\[ min_{\beta_0, \beta_1} \sum_{i=1}^{N} \rho_\tau(y_i - \beta_0 - \beta_1x_i). \] This is exactly the optimization function we need to solve to estimate parameters of the conditional quantile regression! More on this later.

Going from conditional to unconditional quantiles

The figure below shows the 10th, 25th, 50th, 75th, and 90th quantiles of the unconditional SBP distribution (panel a) and SBP distribution among 80+ year olds in our data (panel b). Notice how the values of the quantiles between the unconditional and conditional distributions don’t line up particularly well. This is to be expected, of course, but this poses a challenge when it comes to working back up from conditional quantiles to unconditional quantiles.

#Unconditional density plot
unconditional_den <- ggplot(data, aes(x = sbp)) + 
  geom_density(aes(y = after_stat(density))) +
  xlab("Systolic Blood Pressure (mmHg)") +
  ylab("Density")  +
  labs(title = "a) Unconditional density of systolic blood pressure") + 
  theme_classic() +
  theme(strip.background = element_blank(),
        text = element_text(size = 10)) + 
  geom_vline(aes(xintercept = q10), color = "#F2494C", linewidth = 1) +
  geom_vline(aes(xintercept = q25), color = "#FF9300", linewidth = 1) +
  geom_vline(aes(xintercept = q50), color = "#468C8A", linewidth = 1) +
  geom_vline(aes(xintercept = q75), color = "blue", linewidth = 1) +
  geom_vline(aes(xintercept = q90), color = "purple", linewidth = 1) +
  scale_x_continuous(breaks = seq(60, 240, by = 20))


#Conditional density plot (for 80+ group)
data_8 <- data %>% 
  dplyr::filter(age_group == "80+")

q10_8 <- as.numeric(quantile(data_8$sbp, probs = 0.10)) 
q25_8 <- as.numeric(quantile(data_8$sbp, probs = 0.25))
q50_8 <- as.numeric(quantile(data_8$sbp, probs = 0.50))
q75_8 <- as.numeric(quantile(data_8$sbp, probs = 0.75))
q90_8 <- as.numeric(quantile(data_8$sbp, probs = 0.90)) 


conditional_den <- ggplot(data_8, aes(x = sbp)) + 
  geom_density(aes(y = after_stat(density))) +
  xlab("Systolic Blood Pressure (mmHg)") +
  ylab("Density")  +
  labs(title = "b) Conditional density of systolic blood pressure (80+)") + 
  theme_classic() +
  theme(strip.background = element_blank(),
        text = element_text(size = 10)) + 
  geom_vline(aes(xintercept = q10_8), color = "#F2494C", linewidth = 1) +
  geom_vline(aes(xintercept = q25_8), color = "#FF9300", linewidth = 1) +
  geom_vline(aes(xintercept = q50_8), color = "#468C8A", linewidth = 1) +
  geom_vline(aes(xintercept = q75_8), color = "blue", linewidth = 1) +
  geom_vline(aes(xintercept = q90_8), color = "purple", linewidth = 1) +
  scale_x_continuous(breaks = seq(60, 240, by = 20))


grid.arrange(unconditional_den, conditional_den, nrow = 2)

With means, we could rely on the Law of Iterated Expectation to help us back out the unconditional mean from the conditional means. Unlike means, there is no Law of Iterated Quantiles (yet), which means that we usually cannot use information about quantiles of the conditional distribution to learn about the same quantiles of the unconditional distribution. This in turn means that we need to be careful about choosing which quantile is of substantive interest before conducting a quantile regression analysis.

Choosing between Unconditional and Conditional Quantiles

Choosing between conditional and unconditional quantiles really depends on what research question you are interested in investigating. Are you interested in making comparisons of the exposure-outcome relationship at different quantiles across groups based on individual characteristics? In such a case, you are likely interested in quantiles of the conditional outcome distribution, where the conditioning variables are those individual-level characteristics. On the other hand, if you are interested in understanding what would happen to the distribution of an outcome in the entire population under different levels of an exposure, then you may be interested in quantiles of the unconditional outcome distribution.

Some additional nuances: if you believe that your exposure of interest only leads to a location shift in the outcome distribution, then the conditional and unconditional quantiles coincide and you don’t need to choose between the two. In fact, in such a scenario, you really don’t need to do quantile regressions at all and mean models will be perfectly adequate in capturing distributional effects of an exposure. However, if you believe that the exposure: 1) affects the variance of the outcome distribution and 2) interacts with other covariates you include in your model, then the conditional and unconditional quantiles will not coincide and you will need to be careful about identifying which outcome distribution you are truly interested in.

Linear Regression

Learning Aims

  1. Linear regression and the conditional expectation function
  2. Estimating coefficients and standard errors in linear regression
  3. Interpreting linear regression results

Let’s say our objective is to quantify the association of educational attainment with systolic blood pressure (SBP). A potential solution could be to fit an unadjusted linear regression model by plotting a linear line through the data.

Unadjusted Ordinary Least Squares Model

ggplot(aes(x = schlyrs, y = sbp), data = data) +
  geom_point() +
  scale_x_continuous(limits = c(5, 17),
                     breaks = seq(5, 17, by = 1)) +
  xlab("School Years") +
  ylab("Systolic Blood Pressure (mmHg)") + 
  geom_smooth(method = "lm", formula = y ~ x, color = "#FF9300") +
  theme(panel.background = element_rect(fill = 'white', colour = 'white'),
        panel.grid.major = element_blank(),
        panel.grid.minor = element_blank(),
        axis.text = element_text(size = 12),
        axis.title = element_text(size = 13.5, face = "bold"),
        plot.title = element_text(size = 16, face = "bold"))

ols_un <- lm(sbp ~ schlyrs_c, data = data)
summary(ols_un)$coefficients["schlyrs_c", ]
##      Estimate    Std. Error       t value      Pr(>|t|) 
## -1.091731e+00  8.734720e-02 -1.249875e+01  1.509392e-35

But what exactly is linear regression modeling? The linear regression we fit in our sample is a model of the population regression line, which itself is a model of the conditional expectation function (CEF). The CEF is a function which describes how the conditional expectation of one variable changes as we change values of the conditioning variable. The CEF is usually written as \(E[Y|X]\) or in the context of our example, \(E[SPB|schlyrs]\). When the CEF is linear, i.e., when the conditional means of a variable can be connected by a straight line, then the population regression line and the CEF coincide. When the CEF is non-linear, the population regression line provides the best linear approximation to the non-linear CEF, where “best” is defined in terms of the line that minimizes the expected mean squared error.

Conditional Expectation Function (CEF)

#install.packages("gganimate")
#install.packages("gifski")
#install.packages("magick")

#Loading some libraries
library(gganimate) #Animation plots
library(gifski) #Graphics
library(magick) #Graphics

#Estimating the conditional means of SBP by level of schooling
#Reducing data
df_linreg_basic <- data %>%
  select("schlyrs", "sbp", "gender") %>%
  group_by(schlyrs) %>%
  mutate(mean_sbp = mean(sbp, na.rm = TRUE)) %>%
  ungroup()

#Creating a dataset for plotting
df_linreg_basic_animate <- rbind(
  #Step 1: Raw data only
  df_linreg_basic %>% mutate(mean_sbp = NA, time = '1. Full data'),
  #Step 2: SBP means only
  df_linreg_basic %>% mutate(sbp = mean_sbp,
                             mean_sbp = NA, time = '2. Conditional means')
)

#GIF1: Animating the conditional expectation
linreg_scatter <- ggplot(data = df_linreg_basic_animate, aes(x = schlyrs, y = sbp)) +
  
  #Regular plot
  geom_point() +
  theme(panel.background = element_rect(fill = 'white', colour = 'white'),
        panel.grid.major = element_blank(),
        panel.grid.minor = element_blank(),
        axis.text = element_text(size = 12),
        axis.title = element_text(size = 13.5, face = "bold"),
        plot.title = element_text(size = 16, face = "bold")) +
  scale_x_continuous(limits = c(5, 17),
                     breaks = seq(5, 17, by = 1)) +
  xlab("Total years of schooling") +
  ylab("Systolic Blood Pressure (mmHg)") +
  
  #Animation
  transition_states(time, wrap = FALSE) +
  ease_aes('linear') +
  exit_fade() +
  enter_fade()

animate(linreg_scatter,
        width = 10, height = 6.67, unit = "in",
        res = 300,
        renderer = gifski_renderer(loop = FALSE),
        fps = 20)

The conditional expectation function (CEF) tells us how the conditional mean of the outcome changes as we change values of the conditioning variable

The population regression line is represented by the equation \(\beta_0 + \beta_1x_i\) or \(\beta_0 + \beta_1schlyrs_i\) in the context of our example. The linear regression we fit in our samples represents our attempt to estimate parameters of this population regression line and each is written as \(\hat{\beta_0} + \hat{\beta_1}x_i\), where the \(\hat{}\) (“hat”) represents an estimated quantity. In the context of our example, the sample regression line is represented by the equation \(\hat{\beta_0} + \hat{\beta_1}schlyrs_i\).

Population Regression Line approximates the Conditional Expectation Function

#Creating a dataset of conditional means
df_sbp_means <- data %>%
  group_by(schlyrs) %>%
  summarise(mean_sbp = mean(sbp, na.rm = TRUE))

#CEF plot with regression line
cef_reg <- ggplot(data = df_sbp_means, aes(x = schlyrs, y = mean_sbp)) +
  geom_point() +
  geom_line() +
  geom_smooth(data = df_linreg_basic, aes(x = schlyrs, y = sbp), method = "lm", formula = y ~ x, color = "#FF9300", se = FALSE) +
  theme(panel.background = element_rect(fill = 'white', colour = 'white'),
        panel.grid.major = element_blank(),
        panel.grid.minor = element_blank(),
        axis.text = element_text(size = 12),
        axis.title = element_text(size = 13.5, face = "bold"),
        plot.title = element_text(size = 16, face = "bold")) +
  scale_x_continuous(limits = c(5, 17),
                     breaks = seq(5, 17, by = 1)) +
  scale_y_continuous(limits = c(60, 240),
                     breaks = seq(60, 240, by = 40)) +
  xlab("Total years of schooling") +
  ylab("Systolic Blood Pressure (mmHg)") 

print(cef_reg)

Since it is a model of the conditional expectation function, linear regression provides us with an estimate for how the mean of the outcome changes as we change the exposure by one unit

In the context of our example, linear regression answers the question: By how much does mean systolic blood pressure change for each additional year of schooling?

Estimating and Interpreting Coefficients

Imagine that our model of the world (i.e., the data generating process) can be represented with the equation \(y_i = \beta_0 + \beta_1x_i + \epsilon_i\), where \(\epsilon\) represents the error term. This equation can be rewritten to solve for \(\epsilon\), where \(\epsilon_i = y_i - \beta_0 - \beta_1x_i\). The population regression line is \(E[Y|X] = \beta_0 + \beta_1x_i\), where we assume that \(E[\epsilon|X] = E[\epsilon] = 0\) (a.k.a, the zero conditional mean assumption) and \(Var(\epsilon) = \sigma^2\) (a.k.a, the constant variance assumption or homoskedasticity).

We can calculate the coefficients of the population regression line by minimizing the expectation of the squared error term, which we can think of as minimizing the sum of squared errors: \((\beta_0,\beta_1) = min\sum_{i}{\epsilon_i^2} = min_{\beta_0,\beta_1}\sum_{i}(y_i - \beta_0 - \beta_1x_i)^2\). As mentioned earlier, linear regression in our sample is trying to estimate parameters of this population regression line. There are several ways to estimate coefficients of our sample linear regression (e.g., ordinary least squares, moment estimators, or maximum likelihood estimation). We will focus on Ordinary Least Squares (OLS) because it’s easy, familiar, and will come in handy later when we’re talking about unconditional quantile regressions.

OLS mimics the process of finding the population regression coefficients by minimizing the estimate of the sum of squared errors in the sample: \((\hat{\beta_0},\hat{\beta_1}) = min\sum_{i}{\hat{\epsilon_i}^2} = min_{\hat{\beta_0},\hat{\beta_1}}\sum_{i}(y_i - \hat{\beta_0} - \hat{\beta_1}x_i)^2\). This minimization has an analytic solution: \(\hat{\beta_0} = \bar{y} - \hat{\beta_1}\bar{x}\) and \(\hat{\beta_1} = \frac{\sum_{i=1}^{N}(x_i-\bar{x})(y_i-\bar{y})}{\sum_{i=1}^{N}(x_i-\bar{x})^2}\). \(\hat{\beta_0}\) is the estimated average value of \(Y\) when \(x_i = 0\) (i.e., the mean of the distribution \(Y|x_i = 0\)). \(\hat{\beta_1}\) is the estimated change in the conditional mean of \(Y\) for a one unit change in \(X\).

OLS coefficients can be written in other ways too: for example, in matrix notation, we can write the linear regression model as \(\widehat{E[Y|X]} = X'\hat{\beta}\) and the solution as \(min_{\beta}\sum_{i=1}^{N}(Y - X'\hat{\beta})^2 \rightarrow \hat{\beta} = (X'X)^{-1}(X'Y)\). Here \(\hat{\beta}\) is the vector of coefficients, \(X\) is the design matrix which includes all independent variables in the model as well as a column of 1’s for the intercept, \(X'\) is the transpose of the matrix \(X\) (i.e., rows and columns switched), and \(Y\) is the vector of the outcome.

The matrix notation is particularly useful when we have more than one covariate in our regression model. What’s nice about linear regression is that regardless of how many covariates you include the interpretation of the \(\beta_1\) coefficient remains basically the same. For example, suppose we were trying to estimate parameters of the model: \(E[Y|X] = \beta_0 + \beta_1x_{1,i} + \beta_2x_{2,i}\). We could interpret our estimate for \(\beta_1\) (i.e., \(\hat{\beta_1}\)) as the estimated change in the mean value of \(Y\) due to a one unit change in \(X_1\) holding \(X_2\) constant. We fit a multivariable linear regression model below and provide an interpretation on the “schlyrs_c” coefficient:

#Fit adjusted ols model
ols <- lm(sbp ~ schlyrs_c + age + age2 + gender + race + southern + mom_ed +
            dad_ed + year, data = data)

summary(ols)$coefficients["schlyrs_c", ]
##      Estimate    Std. Error       t value      Pr(>|t|) 
## -7.905091e-01  9.296671e-02 -8.503141e+00  2.145557e-17

We can interpret coefficient on the “schlyrs_c” as: a one year increase in years of schooling is associated with a 7.9 mmHg decrease in average SBP, holding all other covariates in the model constant.

Estimating standard errors

The estimated coefficients in the linear regression model are random variables themselves and have a distribution, which we call the “sampling distribution”. The coefficients are random variables in the sense that they have been fit on a sample, and fitting the sample regression (as opposed to a population regression) on different samples from the same population would produce (potentially) different values of the coefficients. Under some assumptions, the central limit theorem tells us that the sampling distribution is normally distributed with the mean equal to the population coefficient value. We use standard errors to understand how close our estimate is to the mean (i.e., standard deviation of the sampling distribution).

When the error term is homoskedastic, the standard error is calculated using \(se(\hat{\beta})=\sigma^2(X'X)^{-1}\). However, this assumption is almost always violated in practice. One way the assumption is violated is that the error variance is a function of the covariates of the model, a situation known as heteroskedasticity. In this situation, the errors are no longer assumed to be identically distributed (but still assumed to be independent). Heteroskedasticity robust standard errors can be estimated in our sample linear regression coefficient using \(se(\hat{\beta})=(X'X)^{-1}X' \Omega X(X'X)^{-1}\) where \(\Omega\) is the variance-covariance matrix

\[ \Omega = \left[\begin{array}{cccc} \sigma_1^2 & 0 & ... & 0 \\ 0 & \sigma_2^2 & ... & 0 \\ \vdots & 0 & \ddots & \vdots \\ 0 & ... & ... & \sigma_n^2 \end{array}\right] = \left[\begin{array}{c} \epsilon_1^2 & 0 & ... & 0 \\ 0 & \epsilon_2^2 & ... & 0 \\ \vdots & 0 & \ddots & \vdots \\ 0 & ... & ... & \epsilon_n^2 \end{array}\right] = \left[\begin{array}{c} (y_1-\hat{y_1})^2 & 0 & ... & 0 \\ 0 & (y_2-\hat{y_2})^2 & ... & 0 \\ \vdots & 0 & \ddots & \vdots \\ 0 & ... & ... & (y_n-\hat{y_n})^2 \end{array}\right] \]

Robust Standard Errors

#install.packages("lmtest")
#install.packages("sandwich")

library(lmtest) #coeftest for robust standard errors
library(sandwich) #vcovHC for robust standard errors

#Calculate robust standard errors (robust t-test)
ols_robust <- coeftest(ols, vcov = vcovHC(ols, type = 'HC0'))
ols_robust["schlyrs_c", ]
##      Estimate    Std. Error       t value      Pr(>|t|) 
## -7.905091e-01  9.451984e-02 -8.363420e+00  7.016933e-17
#Compare to unadjusted standard errors
summary(ols)$coefficients["schlyrs_c", ]
##      Estimate    Std. Error       t value      Pr(>|t|) 
## -7.905091e-01  9.296671e-02 -8.503141e+00  2.145557e-17

Notice that our robust standard error estimate for the coefficient on “schlyrs_c” is slightly different from the “non-robust” standard error estimate we saw earlier.

Bootstrapped Confidence Intervals

#install.packages("boot")
#install.packages("simpleboot")

library(boot) #Bootstrapping
library(simpleboot) #Bootstrapping

ols_func <- function(data){
  
  set.seed(123)
  ols_mod <- lm(sbp ~ schlyrs_c + age + age2 + gender + race + southern +
                  mom_ed + dad_ed + year, data = data)
  summary(ols_mod)
  
  #Bootstrap standard error
  ols_boot <- lm.boot(ols_mod, R = 500)
  boot_sum <- summary(ols_boot)
  
  ols_schlyrs_est <- as.numeric(boot_sum$orig.lm$coefficients["schlyrs_c"])
  ols_schlyrs_sd <- as.numeric(boot_sum$stdev.params["schlyrs_c"])
  
  ols_row <- data.frame("Quantile" = -0.1,
                        "Estimate" = ols_schlyrs_est,
                        "Lower" = ols_schlyrs_est - (1.96*ols_schlyrs_sd),
                        "Upper" = ols_schlyrs_est + (1.96*ols_schlyrs_sd),
                        "regtype" = "OLS")
  
  return(ols_row)
  
}

ols_results <- ols_func(data)
ols_results[,"Estimate"]
## [1] -0.7905091
#Compare to robust standard errors
#ols_robust["schlyrs_c", ]

#Compare to unadjusted standard errors
#summary(ols)$coefficients["schlyrs_c", ]

Key Takeaways

  1. Linear regression approximates the conditional expectation function (CEF)
     - Coefficients can be interpreted as the change in the slope of the CEF for a unit change in the exposure
  2. Analytic solution exists to estimating linear regression coefficients
  3. Violations of independent and identically distributed errors assumption can be accounted for in estimating standard errors

Conditional Quantile Regression (CQR)

Learning Aims

  1. Conditional quantile regressions (CQR) and the conditional quantile function
  2. Estimating coefficients and standard errors in CQR
  3. Interpreting CQR results

Just as the CEF describes how the mean of a variable changes as we change values of the conditioning variable(s), the Conditional Quantile Function (CQF) describes how quantiles of a variable change as we change values of the conditioning variable. For example, in our data, the CQF for SBP at the 50th quantile conditional on years of schooling describes how the median of SBP changes as we go from those who have 5 years of schooling to 6 years of schooling to 7 years of schooling and so forth. Similarly, the CQF for SBP at the 75th quantile conditional on years of schooling describes how the 75th quantile of SBP changes as we go from 5 to 6 all the way to 17 years of schooling.

We read earlier that linear regression models the CEF. In the same way, conditional quantile regression (CQR) models the CQF at the quantile of interest. So, the population CQR is a linear model for the potentially non-linear CQF just as the population linear regression is a linear model for the potentially non-linear CEF. It follows then that the CQR we fit in our sample is an approximation of the population CQR at the quantile of interest, just as the linear regression we fit in our sample is an approximation of the population linear regression.

Just as linear regression models the conditional expectation function, conditional quantile regression models the conditional quantile function

CQR allows us to understand how an exposure affects the outcome distribution by comparing estimates of the exposure-outcome relationship across multiple quantiles and with respect to the conditional mean. For example, consider a scenario in which the CEF is flat (i.e., slope = 0) and the CQF at the 75th quantile has a negative slope. This suggests that as we move to higher values of the conditioning variable, the right tail of the conditional outcome distribution gets pulled closer and closer to the mean, which suggests a reshaping of the outcome distribution.

How do we go about estimating CQR coefficients? Earlier, we saw that estimating quantiles of the conditional distribution of a random variable required minimizing the following:

\[ min_\beta \sum_{i=1}^{N} \rho_\tau(y_i - g(x_i, \beta)). \]

The most basic CQR model replaces \(g(x_i, \beta)\) with a linear in parameters model such as \(\beta_0 + \beta_1x_i\). This leads to the minimization:

\[ min_{\beta_0, \beta_1} \sum_{i=1}^{N} \rho_\tau(y_i - \beta_0 - \beta_1x_i). \] More elaborate CQR models may include non-linear functions of the parameters, but we will not focus on those in our workshop. Rather’s let’s talk about how we can estimate the coefficients of our linear in parameters CQR model. We learned in the linear regression review that the OLS estimator for linear regression has an analytic solution. This made our life terribly easy. However, we don’t have such a luxury in the CQR world: in the minimization above, there is no closed form solution to find \(\beta_0\) and \(\beta_1\). It follows that there is no closed form solution to a linear in parameters CQR model with more than one independent variable.

As many have asked: what is to be done? Is all hope lost? Not at all! Roger Koenker and Gilbert Bassett Jr., in their 1978 paper introducing CQR, showed us that we can estimate coefficients of the CQR model described above by using linear programming methods to solve the minimization problem. Linear programming methods are numerical ways of solving optimization problems subject to certain constraints (i.e., they are algorithms that “guess” the solution to an equation and keep modifying their guess to get to the best guess of the solution). We will not go into the details of how we can re-express the minimization problem described above as a linear programming problem – it is rather hairy and honestly not particularly instructive. We will say that there are numerous linear programming optimization methods out there to solve the CQR minimization problem. In most cases, however, the default linear programming choice provided in statistical packages for CQR should be good enough for fitting the model we want to fit.

Fitting CQR models in R

In R, we fit CQR models by using the function “rq” from the package quantreg to run our conditional quantile regressions (4-6). The structure of the command mirrors the structure of the command to fit linear regression using “lm()”. Specifically, we can fit CQR models in R as:

rq(formula, tau, data, method, …)

In the syntax above, formula takes on the structure “y ~ x”, tau refers to the quantile of interest, data refers to the dataset being used, and method refers to the optimization method you want to use to solve the linear programming problem to estimate model parameters. The default method is known as the Barrodale and Roberts algorithm (method = “br”). As the authors of the “quantreg” package suggest, the default optimization method should be perfectly fine even if you have a relatively large dataset with several thousand observations.

Let’s begin by fitting a model of SBP on years of schooling at the median without any additional control variables and think about how we can interpret the estimate.

#install.packages("quantreg")
library(quantreg) #Conditional quantile regression

#50th quantile
un_cqr_50 <- rq(sbp ~ schlyrs_c, data = data, tau = 0.5) 
summary(un_cqr_50)
## 
## Call: rq(formula = sbp ~ schlyrs_c, tau = 0.5, data = data)
## 
## tau: [1] 0.5
## 
## Coefficients:
##             Value     Std. Error t value   Pr(>|t|) 
## (Intercept) 127.00000   0.28031  453.07029   0.00000
## schlyrs_c    -0.90000   0.09484   -9.48945   0.00000

The CQR at the median produces an output that looks very similar to our linear regression output. Notice that we have an intercept term and a coefficient on the “schlyrs_c” variable. The intercept is our estimate for median SBP among those who have 12 years of schooling (recall that we have centered school years at 12). The coefficient on “schlyrs_c” can be interpreted as the association of an additional year of schooling with median SBP. So, our results suggest that median SBP among those with 12 years of schooling is 127 mmHg and that each additional year of schooling from 12 years is associated with a 0.9 mmHg reduction in median SBP.

Let’s now fit a CQR at the median adjusting for covariates:

cqr_50 <- rq(sbp ~ schlyrs_c + age + age2 + gender + race + southern +
               mom_ed + dad_ed + year, data = data, tau = 0.5) 
summary(cqr_50)
## 
## Call: rq(formula = sbp ~ schlyrs_c + age + age2 + gender + race + southern + 
##     mom_ed + dad_ed + year, tau = 0.5, data = data)
## 
## tau: [1] 0.5
## 
## Coefficients:
##             Value    Std. Error t value  Pr(>|t|)
## (Intercept) 61.93863 13.66039    4.53418  0.00001
## schlyrs_c   -0.71897  0.10217   -7.03679  0.00000
## age          1.50670  0.42296    3.56223  0.00037
## age2        -0.00834  0.00323   -2.57776  0.00996
## gender2      7.82589  0.42336   18.48530  0.00000
## race2        5.72343  0.62719    9.12558  0.00000
## race3        2.23193  0.82229    2.71429  0.00665
## southern     1.47659  0.47213    3.12750  0.00177
## mom_ed       0.00139  0.09712    0.01431  0.98858
## dad_ed      -0.05735  0.08075   -0.71019  0.47761
## year2008     0.42196  0.65757    0.64169  0.52109
## year2010     1.22034  0.70360    1.73443  0.08288
## year2012    -0.24101  0.66995   -0.35974  0.71905
## year2014    -2.72761  0.94059   -2.89991  0.00374
## year2016     0.57579  0.82739    0.69591  0.48650
## year2018    -1.19128  0.95305   -1.24996  0.21135

Notice that our adjusted model also produces an output very similar to the adjusted linear regression output. The interpretation of these estimates is also very similar. For instance, the intercept refers to the median SBP among folks with 12 years of schooling, holding all other covariates at 0. The coefficient on “schlyrs_c” is our estimate for the change in median SBP for each additional year of schooling, holding all other covariates constant. Based on our results, a one year increase in total years of schooling from 12 years is associated with a 0.72 mmHg decrease in median SBP, holding all else constant.

Let’s fit another adjusted model at the 25th quantile just so we get a bit of practice interpreting.

cqr_25 <- rq(sbp ~ schlyrs_c + age + age2 + gender + race + southern +
               mom_ed + dad_ed + year, data = data, tau = 0.25) 
summary(cqr_25)
## 
## Call: rq(formula = sbp ~ schlyrs_c + age + age2 + gender + race + southern + 
##     mom_ed + dad_ed + year, tau = 0.25, data = data)
## 
## tau: [1] 0.25
## 
## Coefficients:
##             Value    Std. Error t value  Pr(>|t|)
## (Intercept) 45.83193 12.64847    3.62352  0.00029
## schlyrs_c   -0.63744  0.09497   -6.71210  0.00000
## age          1.75066  0.38690    4.52488  0.00001
## age2        -0.01093  0.00292   -3.74937  0.00018
## gender2      8.75949  0.40369   21.69874  0.00000
## race2        3.03862  0.67453    4.50482  0.00001
## race3        1.32838  0.75699    1.75483  0.07932
## southern     1.42459  0.44549    3.19784  0.00139
## mom_ed      -0.04106  0.09420   -0.43592  0.66291
## dad_ed      -0.00892  0.07782   -0.11459  0.90877
## year2008     1.33398  0.60955    2.18847  0.02866
## year2010     0.82936  0.78946    1.05054  0.29350
## year2012     0.32310  0.64346    0.50213  0.61559
## year2014    -1.73780  1.19992   -1.44826  0.14758
## year2016     1.99214  0.73642    2.70516  0.00684
## year2018    -1.21576  0.72686   -1.67261  0.09444

The intercept term in this model is our estimate for the 25th quantile of SBP for folks with 12 years of schooling, holding all other covariates at 0. The coefficient on “schlyrs_c” is our estimate for the change in the 25th quantile of SBP for each additional year of schooling, holding all other covariates constant. Based on our results, a one year increase in total years of schooling from 12 years is associated with a 0.64 mmHg decrease in the 25th SBP quantile, holding all else constant.

Estimating standard errors

But, coefficients are only part of the story. In any regression analysis, we care deeply about the standard errors too! The default standard error estimation method is the so called “rank inversion” method described in a 2005 paper by Roger Koenker. This default option assumes that the error term is independent and identically distributed, which is usually not a plausible assumption.

The recommended method of estimating standard errors is to bootstrap them. We can use built-in commands with the “quantreg” package to estimate bootstrap standard errors. In the example below, we show one way of bootstrapping. In this case, the default bootstrap method is used, which involves randomly selecting observations from the data with replacement. In this case, we have specified 50 resamples.

# cqr_50 <- rq(sbp ~ schlyrs_c + age + age2 + gender + race + southern +
#                mom_ed + dad_ed + year, data = data, tau = 0.5)
summary.rq(cqr_50, se = "boot", R = 50)
## 
## Call: rq(formula = sbp ~ schlyrs_c + age + age2 + gender + race + southern + 
##     mom_ed + dad_ed + year, tau = 0.5, data = data)
## 
## tau: [1] 0.5
## 
## Coefficients:
##             Value    Std. Error t value  Pr(>|t|)
## (Intercept) 61.93863 14.16978    4.37118  0.00001
## schlyrs_c   -0.71897  0.08214   -8.75330  0.00000
## age          1.50670  0.44697    3.37092  0.00075
## age2        -0.00834  0.00340   -2.44977  0.01431
## gender2      7.82589  0.43806   17.86499  0.00000
## race2        5.72343  0.58328    9.81257  0.00000
## race3        2.23193  0.77983    2.86206  0.00422
## southern     1.47659  0.44477    3.31989  0.00090
## mom_ed       0.00139  0.11517    0.01207  0.99037
## dad_ed      -0.05735  0.09028   -0.63525  0.52528
## year2008     0.42196  0.60366    0.69901  0.48457
## year2010     1.22034  0.76627    1.59258  0.11129
## year2012    -0.24101  0.61130   -0.39425  0.69340
## year2014    -2.72761  1.07480   -2.53779  0.01117
## year2016     0.57579  0.86830    0.66313  0.50727
## year2018    -1.19128  0.94107   -1.26587  0.20559

Other bootstrap methods are available within the “quantreg” package. These methods include the Parzen, Wei and Ying (1994) method, He and Hu’s (2002) Markov chain bootstrap method, and the Kocherginsky et. al. (2003) bootstrap method. There is also a wild bootstrap method due to Feng et. al. (2011) available. Which method is preferable likely depends on the intricacies of the analysis you are conducting. In general, the default method seems fine for simple, cross-sectional data structures.

The standard non-parametric bootstrap estimator used for CQR approximates the “heteroskedasticity” robust standard error estimator for CQR. We are putting heteroskedasticity in quotes because we have not seen this terminology used as much in the quantile regression world. Rather, the assumption is that the density of the error term at the \(\tau^{th}\) quantile of the conditional error distribution does not equal the density of the error term at the \(\tau^{th}\) quantile of the unconditional error distribution: this is somewhat similar in flavor to the heteroskedasticity assumption, but there are subtle differences. In any case, estimating robust standard errors in the CQR framework relies on estimating the inverse of the error density at the \(\tau^{th}\) quantile. Since the error density will be sparse where the outcome is sparse, and since the outcome is usually sparse at the tails, a key criticism of CQR is that statistical power is usually a problem at the tails of a distribution, which are generally of substantive interest.

Key criticism of conditional quantile regression is that statistical power is usually a problem at the tails of the outcome distribution, which is substantively of interest

Presenting CQR coefficients

Usually, we want to fit CQR models at multiple quantiles of the conditional outcome distribution to assess how an exposure affects the outcome distribution. But, fitting so many regressions poses a challenge in terms of displaying the results. Imagine that we fit a CQR model for each quantile from the 1st to 99th quantiles. Do we really want to have a table where we show the coefficient estimates at each quantile? That would look a bit ugly and might turn off our readers.

One way that we have seen CQR results shown in the literature is the following. We fit the CQR model at each quantile from, say, the 1st to 99th quantiles. We estimate bootstrap standard errors or, at the very least, robust standard errors. We then get estimates for the 95% confidence interval on the point estimate for our exposure. Now, instead of showing each coefficient and confidence interval as a separate row in a table, we will plot them using the built in plot function:

#Fitting CQR models from the 1st to 99th quantiles
cqr_all <- rq(sbp ~ schlyrs_c + age + age2 + gender + race + southern +
                mom_ed + dad_ed + year, data = data, 
              tau = seq(0.01, 0.99, by = 0.01)) #Model all quantiles
cqr_plot <- summary(cqr_all, se = "boot", R = 50, confint = TRUE)

#Plotting coefficients
plot(cqr_plot)

This is, however, not our favorite way to plot CQR results. We think that rather than plotting all the coefficients, it is perhaps most effective to plot coefficients for the exposure variable only. Additionally, the built in “plot” function is not as flexible as “ggplot2”, which we love. So, in order to focus the presentation of results on only the exposure variable and to use ggplot while we’re at it, we propose the following method of plotting CQR results:

cqr_func <- function(data){
  
  result <- data.frame()
  
  for(i in seq(0.01, 0.99, by = 0.01)){
    
    i <- round(i, 2)
    
    set.seed(123)
    cqr_mod <- rq(sbp ~ schlyrs_c + age + age2 + gender + race + southern +
                    mom_ed + dad_ed + year, data = data, tau = i)
    
    
    #Bootstrap standard error
    cqr_boot <- summary.rq(cqr_mod, se = "boot", R = 50)
    cqr_schlyrs_est <- as.numeric(cqr_boot$coefficients["schlyrs_c", "Value"])
    cqr_schlyrs_sd <- as.numeric(cqr_boot$coefficients["schlyrs_c", 
                                                       "Std. Error"])
    
    cqr_row <- data.frame("Quantile" = i,
                          "Estimate" = cqr_schlyrs_est,
                          "Lower" = cqr_schlyrs_est - (1.96*cqr_schlyrs_sd),
                          "Upper" = cqr_schlyrs_est + (1.96*cqr_schlyrs_sd),
                          "regtype" = "CQR")
    
    result <- rbind(result, cqr_row)
    
  }
  
  return(result)
  
}

cqr_results <- cqr_func(data)
##Warning: Solution may be nonunique (Produced because of categorical variables in the model)


#Plotting results
#Axis labels
yaxis <- expression(Delta~SBP~(mmHg))
xaxis <- ""

#Figure
ggplot(data = cqr_results, aes(x = Quantile, y = Estimate)) +
  geom_line(color = "#468c8a", size = 1) +
  geom_ribbon(aes(x = Quantile, ymin = Lower, ymax = Upper), 
              alpha = 0.45, fill = "#468c8a") +
  geom_hline(yintercept = 0, alpha = 0.5) +
  theme(panel.background = element_rect(fill = 'white', colour = 'white'),
        panel.grid.major = element_blank(),
        panel.grid.minor = element_blank()) +
  labs(x = xaxis,
       y = yaxis,
       title = "Association of educational attainment with systolic blood pressure") +
  scale_x_continuous(breaks = c(0.01, seq(0.10, 0.90, 0.10), 0.99),
                     labels = c("q1", "q10", "q20", "q30", "q40", "q50", 
                                "q60", "q70", "q80", "q90", "q99"))

This figure is just a slightly fancier version of the plot that is produced through the in-built plot functionality in the “quantreg” package. We think it looks nicer and could be a good way of displaying CQR results. Notice that educational attainment appears to have a stronger protective association with SBP at higher quantiles of SBP relative to lower quantiles. This suggests that as we go from one level of schooling to another higher level, the SBP distribution is becoming narrower and that the narrowness is due to the higher risk, right tail of the SBP distribution being pulled closer to the center. Notice also how the confidence intervals around the point estimates at higher quantiles are very large: this is an illustration of the idea we saw earlier about how CQR estimates are usually less precise at the tails because data at the tails are usually sparse.

We can also add our OLS estimate to the CQR plot so viewers can easily compare estimates at the mean to estimates at each quantile of the conditional distribution.

ols_cqr <- rbind(ols_results, cqr_results)

#Figure
ggplot(data = ols_cqr %>% filter(regtype != "OLS"),
       aes(x = Quantile, y = Estimate, group = regtype,
           color = regtype, fill = regtype)) +
  geom_line(alpha = 50, linewidth = 0.75) +
  geom_ribbon(aes(ymin = Lower, ymax = Upper), alpha = 0.27, color = NA) +
  geom_point(data = ols_cqr %>% filter(regtype == "OLS")) +
  geom_errorbar(data = ols_cqr %>% filter(regtype == "OLS"),
                aes(ymin = Lower, ymax = Upper), width = 0.01) +
  geom_hline(yintercept = 0, alpha = 0.5) +
  geom_vline(xintercept = -0.04, color = "gray", linetype = "dashed") +
  theme(panel.background = element_rect(fill = 'white', colour = 'white'),
        panel.grid.major = element_blank(),
        panel.grid.minor = element_blank(),
        legend.position = "bottom",
        text = element_text(size = 10))  +
  scale_color_manual(breaks = c("OLS", "CQR"),
                     labels = c("OLS", "CQR"),
                     values = c("#F2494C", "#468c8a")) +
  scale_fill_manual(breaks = c("OLS", "CQR"),
                    labels = c("OLS", "CQR"),
                    values = c("#F2494C", "#468c8a"))  +
  scale_x_continuous(limits = c(-0.11, 1),
                     breaks = c(-0.1, 0.01, seq(0.10, 0.90, 0.10), 0.99),
                     labels = c("OLS", "q1", "q10", "q20", "q30", "q40", "q50", 
                                "q60", "q70", "q80", "q90", "q99")) +
  labs(x = "",
       y = yaxis,
       color = "Model", group = "Model", fill = "Model") +
  theme(legend.position = "bottom",
        legend.box.spacing = unit(0, "pt"),
        legend.margin = margin(-10,0,0,0),
        text = element_text(size = 15))

Illustrating distributional effects

While the figure above allows us to imagine what sort of effect educational attainment is having on SBP, it doesn’t show us explicitly what’s happening. Wouldn’t it be nice to see, for instance, a density plot at a reference level of educational attainment and then another plot at, say, a higher level of educational attainment and compare those two density plots? Now we’re going to create something like that. Please note that this plot has numerous assumptions built into it, which we will discuss in greater detail below. We also want to acknowledge that this plot was inspired by what we saw in Bind et. al. (2015) and Bind et. al. (2017), and we thank Dr. Bind for the generous insights she provided into how she created the plot.

We are going to create this plot as follows: first, we are going to restrict our data to those who have 12 years of schooling. Then, we are going to determine what quantile each individual with 12 years of schooling lies in from the 1st to 99th quantile of the SBP distribution. Following this, we are going to merge in the association of education with SBP at each quantile. Finally, we are going to predict the new value of SBP for each individual with 12 years of schooling had they had a standard deviation increase in years of schooling.

#install.packages("statar")
library(statar) #Binning data

#Subsetting those with 12 years of schooling (Note that we have centered our educational attainment variable at 0, which is why we are setting schlyrs_c == 0 here)
data_12years <- data %>% 
  dplyr::filter(schlyrs_c == 0) %>%
  dplyr::select("sbp") #Keeping only the SBP variable to keep things neat and clean

#Determining the 1st-99th quantile of SBP for those with 12 years of schooling
data_12years$quant <- xtile(data_12years$sbp, n = 99)

#Creating a dataset of only our coefficient estimates
coef_estimates <- cqr_results %>% 
  dplyr::select("Quantile", "Estimate") %>%
  rename("quant" = "Quantile",
         "coef" = "Estimate") %>%
  dplyr::mutate(quant = round(quant, 2) * 100) #Change from (0, 1) to (1, 100)

#Merging coefficient estimates with the SBP data for those with 12 years of schooling
data_12years$quant <- round(data_12years$quant, 0)
coef_estimates$quant <- round(coef_estimates$quant, 0)
data_12years <- merge(data_12years, coef_estimates, by = "quant")

#Creating a counterfactual SBP value
data_12years$counterfactual_sbp <- data_12years$sbp + data_12years$coef

#Long version of data
data_12years_long <- data_12years %>%
  dplyr::select(-coef) %>%
  pivot_longer(!quant, names_to = "model", values_to = "sbp") %>%
  dplyr::mutate(model = case_when(model == "sbp" ~ "Factual",
                                  model == "counterfactual_sbp" ~ "Counterfactual"))

data_12years_long$model <- factor(data_12years_long$model, 
                                  levels = c("Factual", "Counterfactual"))

ggplot(data = data_12years_long, aes(x = sbp, color = model, 
                                     linetype = model)) +
  geom_density(aes(y = after_stat(density)), linewidth = 1.5) +
  theme(panel.background = element_rect(fill = 'white', colour = 'white'),
        panel.grid.major = element_blank(),
        panel.grid.minor = element_blank(),
        legend.position = "bottom",
        strip.background = element_blank(),
        text = element_text(size = 20)) + 
  labs(color = "Distribution",
       linetype = "Distribution",
       x = "SBP (mmHg)",
       y = "Density") +
  scale_color_manual(values = c("Factual" = "#468c8a", 
                                "Counterfactual" = "#FF9300")) +
  scale_linetype_manual(values = c("Factual" = "solid", 
                                   "Counterfactual" = "dashed"))

In the figure above, the solid teal density represents the empirical density of SBP for folks with 12 years of schooling in our data. The dotted tangerine density represents the SBP density we expect had these individuals experienced a standard deviation increase in years of schooling (~2.4 years). As we would expect based on our point estimates from the different quantile regression models, the right tail of the distribution is being pulled closer to the center in the counterfactual SBP distribution whereas the left tail is relatively unaffected. We think this is an interesting way of visualizing the distributional “effect” of education on SBP among the largest group of folks in our sample in terms of educational attainment.

But, as we said earlier, this figure makes some important assumptions which may or may not hold. The first assumption is that we assume the relationship of education with SBP is the same within unit quantile intervals. For instance, we assume that the relationship of education and SBP between the 10th and 11th quantiles is the same as the estimated relationship between education and SBP at the 10th quantile. This may be an okay assumption to make – certainly we don’t think it’s particularly devastating.

#Determining quantiles of the counterfactual SBP distribution
data_12years$quant_new <- xtile(data_12years$counterfactual_sbp, n = 99)

#Checking how many people change which quantile their SBP value falls into
data_12years$change <- data_12years$quant - data_12years$quant_new
table(data_12years$change) #No one changes quantiles 
## 
##    0 
## 2496

Another assumption that we appear to effectively make is that of rank invariance. That is, individuals with 12 years of schooling maintain their rank in the SBP distribution even when their education increases by ~2.4 years. Notice how no one actually changes their quantile value per the table above. Those who work in quantile regressions usually find the rank invariance assumption a bit difficult to justify. We too think that rank invariance is a strong assumption here: we should probably expect a bit more changes in rank under different exposure levels due to randomness.

Post-estimation tests

Fitting so many CQR results raises the question: are our estimates at, say, the 10th quantile really different from our estimates at the 90th quantile? This is a question we’ve had in our group a number of times too. We use the anova test in the “quantreg” package to test this hypothesis. For example, say we wanted to test if the 10th quantile estimate is statistically different from the 50th quantile estimate and if that’s different from the 90th quantile estimate. We may want to know this in order to understand if our data suggest that a location shift hypothesis best describes the exposure-outcome relationship or if the exposure indeed reshapes the outcome distribution. We can do this test as follows:

#Fitting CQR at the 10th, 50th, and 90th quantiles
cqr_10 <- rq(sbp ~ schlyrs_c + age + age2 + gender + race + southern +
               mom_ed + dad_ed + year, data = data, tau = 0.1) 

cqr_50 <- rq(sbp ~ schlyrs_c + age + age2 + gender + race + southern +
               mom_ed + dad_ed + year, data = data, tau = 0.5) 

cqr_90 <- rq(sbp ~ schlyrs_c + age + age2 + gender + race + southern +
               mom_ed + dad_ed + year, data = data, tau = 0.9)

#Testing the difference in coefficients
anova(cqr_10, cqr_50)
## Quantile Regression Analysis of Deviance Table
## 
## Model: sbp ~ schlyrs_c + age + age2 + gender + race + southern + mom_ed + dad_ed + year
## Joint Test of Equality of Slopes: tau in {  0.1 0.5  }
## 
##   Df Resid Df F value  Pr(>F)    
## 1 15    17735  5.3067 8.8e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
anova(cqr_50, cqr_90)
## Quantile Regression Analysis of Deviance Table
## 
## Model: sbp ~ schlyrs_c + age + age2 + gender + race + southern + mom_ed + dad_ed + year
## Joint Test of Equality of Slopes: tau in {  0.5 0.9  }
## 
##   Df Resid Df F value    Pr(>F)    
## 1 15    17735  4.8036 2.016e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
anova(cqr_10, cqr_90)
## Quantile Regression Analysis of Deviance Table
## 
## Model: sbp ~ schlyrs_c + age + age2 + gender + race + southern + mom_ed + dad_ed + year
## Joint Test of Equality of Slopes: tau in {  0.1 0.9  }
## 
##   Df Resid Df F value    Pr(>F)    
## 1 15    17735   10.43 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
anova(cqr_10, cqr_50, cqr_90)
## Quantile Regression Analysis of Deviance Table
## 
## Model: sbp ~ schlyrs_c + age + age2 + gender + race + southern + mom_ed + dad_ed + year
## Joint Test of Equality of Slopes: tau in {  0.1 0.5 0.9  }
## 
##   Df Resid Df F value    Pr(>F)    
## 1 30    26595  5.6505 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The test above compared the coefficient on the years of schooling variable across the three models. Additionally, it compared the coefficient on all other covariates included in our model. Then, it gave us a p-value for the significance of all these coefficients being jointly identical. The test procedure is described in Koenker and Bassett (1982). The p-value above suggests that the coefficients are, in fact, statistically distinct from one another at the 5% level.

However, what we really wanted to test was whether the coefficient on years of schooling was identical across the three models and not whether all coefficients together were identical. We haven’t figured out a way to do this exactly, but one approach, which isn’t entirely satisfactory, is to fit the unadjusted CQR models and then compare them using “anova”.

#Fitting CQR at the 10th, 50th, and 90th quantiles, unadjusted models
un_cqr_10 <- rq(sbp ~ schlyrs_c, data = data, tau = 0.1) 

un_cqr_50 <- rq(sbp ~ schlyrs_c, data = data, tau = 0.5) 

un_cqr_90 <- rq(sbp ~ schlyrs_c, data = data, tau = 0.9)

#Testing the difference in coefficients
anova(un_cqr_10, un_cqr_50)
## Quantile Regression Analysis of Deviance Table
## 
## Model: sbp ~ schlyrs_c
## Joint Test of Equality of Slopes: tau in {  0.1 0.5  }
## 
##   Df Resid Df F value    Pr(>F)    
## 1  1    17749  12.098 0.0005061 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
anova(un_cqr_50, un_cqr_90)
## Quantile Regression Analysis of Deviance Table
## 
## Model: sbp ~ schlyrs_c
## Joint Test of Equality of Slopes: tau in {  0.5 0.9  }
## 
##   Df Resid Df F value    Pr(>F)    
## 1  1    17749  25.532 4.393e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
anova(un_cqr_10, un_cqr_90)
## Quantile Regression Analysis of Deviance Table
## 
## Model: sbp ~ schlyrs_c
## Joint Test of Equality of Slopes: tau in {  0.1 0.9  }
## 
##   Df Resid Df F value    Pr(>F)    
## 1  1    17749  40.807 1.722e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
anova(un_cqr_10, un_cqr_50, un_cqr_90)
## Quantile Regression Analysis of Deviance Table
## 
## Model: sbp ~ schlyrs_c
## Joint Test of Equality of Slopes: tau in {  0.1 0.5 0.9  }
## 
##   Df Resid Df F value   Pr(>F)    
## 1  2    26623   20.54 1.22e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

CQR Function

Below, we also provide a function that loops through and fits conditional quantile regressions at the 10th, 25th, 50th, 75th, and 90th quantiles of the conditional outcome distribution. This function can be modified to fit all quantiles from the 1st through the 99th, or to fit select quantiles of your choosing.

restricted_cqr_func <- function(data){
  
  result <- data.frame()
  
  for(i in c(0.10, 0.25, 0.50, 0.75, 0.90)){
    
    i <- round(i, 2)
    
    set.seed(123)
    cqr_mod <- rq(sbp ~ schlyrs_c + age + age2 + gender + race + southern +
                    mom_ed + dad_ed + year, data = data, tau = i)
    
    
    #Bootstrap standard error
    cqr_boot <- summary.rq(cqr_mod, se = "boot", R = 50)
    cqr_schlyrs_est <- as.numeric(cqr_boot$coefficients["schlyrs_c", "Value"])
    cqr_schlyrs_sd <- as.numeric(cqr_boot$coefficients["schlyrs_c", 
                                                       "Std. Error"])
    
    cqr_row <- data.frame("Quantile" = i,
                          "Estimate" = cqr_schlyrs_est,
                          "Lower" = cqr_schlyrs_est - (1.96*cqr_schlyrs_sd),
                          "Upper" = cqr_schlyrs_est + (1.96*cqr_schlyrs_sd),
                          "regtype" = "CQR")
    
    result <- rbind(result, cqr_row)
    
  }
  
  return(result)
  
}


#Run CQR function for subset of quantiles
restricted_cqr_results <- restricted_cqr_func(data)
## Warning in rq.fit.br(x, y, tau = tau, ...): Solution may be nonunique

## Warning in rq.fit.br(x, y, tau = tau, ...): Solution may be nonunique

## Warning in rq.fit.br(x, y, tau = tau, ...): Solution may be nonunique
##Warning: Solution may be nonunique (Produced because of categorical variables in the model)

#Compare to all results
# restricted_cqr_results

# cqr_results %>%
#   dplyr::filter(Quantile == 0.10 | Quantile == 0.25 | Quantile == 0.50 | 
#                   Quantile == 0.75 | Quantile == 0.90)

Key Takeaways

  1. Just as linear regression models the CEF, CQR models the CQF
     - The CQF can be defined for any quantile of interest
  2. CQR coefficients can be estimated by minimizing the sum of \(\rho_\tau(.)\) which does not have an analytic solution
     - Coefficients can be interpreted as the change in the conditional quantile of interest for a unit change in the exposure
  3. CQR standard errors are inversely proportional to the error density, because of which in parts of the outcome distribution where outcome data are sparse, standard errors are larger
     - Bootstrap is the preferred method of estimating the standard errors

Unconditional Quantile Regression (UQR)

Learning Aims

  1. Counterfactual unconditional distributions and contrasts
  2. Firpo, Fortin, and Lemieux (2009) method of estimating parameters of the unconditional quantile regression
  3. Interpreting unconditional quantile regression results

Counterfactual Distributions

What would we do if we wanted to learn how the unconditional outcome distribution changed in response to some population-level intervention? How can we contrast quantiles of the factual and counterfactual distributions?

Borah and Basu (2013) suggest that CQR can quantify these contrasts under some stringent assumptions: 1) the outcome is only a function of the exposure or 2) the exposure only induces a location shift in the outcome in the presence of other covariates. If an exposure interacts with other covariates in the data generating process – which is quite likely given how complex our world is – then CQR cannot estimate contrasts of the unconditional outcome distribution.

Several methods have been developed to estimate contrasts between the factual and counterfactual unconditional outcome distributions. For this workshop, we will focus on a method developed by Sergio Firpo, Nicole Fortin, and Thomas Lemieux (2009) (2-3). We focus on this method because it is a regression-based method (and so compares nicely with CQR and linear regression), is relatively easy to implement, and is quite popular in the economics literature.

Firpo’s unconditional quantile regression method quantifies the change in the \(\tau^{th}\) quantile of the unconditional outcome distribution changes for a small change in the exposure distribution.

At the heart of Firpo’s method is the recentered influence function (RIF). The RIF is the sum of a distributional statistic’s influence function and the distributional statistic itself. Let’s break this down a bit.

Distributional statistics can be anything from means to quantiles to Gini coefficients. Influence functions are a way of assessing the “robustness” of distributional statistics; they allow us to compute by how much, say, the mean of a distribution would change if we made a “small” change or addition to the existing distribution. Each distributional statistic has an influence function specific to it, and the influence function for the \(\tau^{th}\) quantile of Y is \(IF(y; q_{\tau}) = \frac{\tau - I(y \leq q_{\tau})}{f_y(q_{\tau})}\), where \(q_{\tau}\) is the value of a random variable \(Y\) at the \(\tau^{th}\) quantile, \(I(y \leq q_{\tau})\) is an indicator function for an individual falling above or below \(q_{\tau}\), and \(f_y(q_{\tau})\) is the density of Y at the \(\tau^{th}\) quantile. Firpo then defines the RIF for the \(\tau^{th}\) quantile as \(RIF(y;q_{\tau}) = q_{\tau} + IF(y; q_{\tau}) = q_{\tau} + \frac{\tau - I(y \leq q_{\tau})}{f_y(q_{\tau})}\). A nice property of the RIF is that its mean equals the distributional statistic of interest. Note that in the case of quantiles: \(E[RIF(y; q_{\tau})] = E[q_{\tau} + \frac{\tau - I(y \leq q_{\tau})}{f_y(q_{\tau})}] = E[q_{\tau}] + E[\frac{\tau - I(y \leq q_{\tau})}{f_y(q_{\tau})}] = q_\tau\) because \(E[\frac{\tau - I(y \leq q_{\tau})}{f_y(q_{\tau})}]=0\) (since the numerator of this equation is \(\tau - \tau\) in expectation).

Firpo then proposes using the RIF as the outcome in a regression model to estimate the change in the \(\tau^{th}\) quantile of the unconditional outcome distribution for a small change in the unconditional distribution of the exposure. They call this modeling strategy the RIF-regression and propose several ways of estimating the RIF-regression, one of which is OLS.

The Firpo method has three main steps:

  1. Estimate the RIF in the unconditional outcome distribution
  2. Regress the RIF on the exposure and covariates
  3. Estimate standard errors using bootstrapping or the sandwich estimator

First, we will use the package dineq to estimate the recentered influence function (RIF) at certain quantiles of the unconditional outcome distribution. Note that RIF values can be found for any quantile between (0,1).

Finding the RIF

#install.packages("dineq")
library(dineq) #RIF package

#RIF at q25
data$rif_q25 <- rif(data$sbp, weights = NULL, method = "quantile", 
                    quantile = 0.25)
data$rif_q25_r <- round(data$rif_q25, 2)

rif25 <- ggplot(aes(x = factor(rif_q25_r)), data = data) +
  geom_bar(width = 0.2, fill = "#F2494C") + 
  labs(x = "RIF Value",
       y = "Count",
       title = "25th Quantile") +
  theme(panel.background = element_rect(fill = 'white', colour = 'white'),
        panel.grid.major = element_blank(),
        panel.grid.minor = element_blank(),
        legend.position = "bottom",
        strip.background = element_blank(),
        text = element_text(size = 15))


#RIF at q50
data$rif_q50 <- rif(data$sbp, weights = NULL, method = "quantile", 
                    quantile = 0.50)
data$rif_q50_r <- round(data$rif_q50, 2)

rif50 <- ggplot(aes(x = factor(rif_q50_r)), data = data) +
  geom_bar(width = 0.2, fill = "#468C8A") + 
  labs(x = "RIF Value",
       y = "Count",
       title = "50th Quantile") +
  theme(panel.background = element_rect(fill = 'white', colour = 'white'),
        panel.grid.major = element_blank(),
        panel.grid.minor = element_blank(),
        legend.position = "bottom",
        strip.background = element_blank(),
        text = element_text(size = 15))


#RIF at q90
data$rif_q75 <- rif(data$sbp, weights = NULL, method = "quantile", 
                    quantile = 0.75)
data$rif_q75_r <- round(data$rif_q75, 2)

rif75 <- ggplot(aes(x = factor(rif_q75_r)), data = data) +
  geom_bar(width = 0.2, fill = "#FF9300") + 
  labs(x = "RIF Value",
       y = "Count",
       title = "75th Quantile") +
  theme(panel.background = element_rect(fill = 'white', colour = 'white'),
        panel.grid.major = element_blank(),
        panel.grid.minor = element_blank(),
        legend.position = "bottom",
        strip.background = element_blank(),
        text = element_text(size = 15))


#Combine
rif_vals <- data %>%
  ungroup() %>%
  dplyr::select(id, ends_with("_r")) %>%
  rename("q25" = "rif_q25_r",
         "q50" = "rif_q50_r",
         "q75" = "rif_q75_r") %>%
  pivot_longer(!id, names_to = "Quantile", values_to = "RIF_Values")

rif_comb <- ggplot(aes(x = RIF_Values, fill = factor(Quantile)),
                   data = rif_vals) +
  geom_bar(width = 5) +
  labs(x = "RIF Value",
       y = "Count",
       fill = "Quantile",
       title = "25th, 50th, and 75th Quantiles") +
  scale_fill_manual(values = c("#F2494C", "#468C8A", "#FF9300")) + 
  scale_x_continuous(breaks = seq(25, 300, by = 25)) +
  theme(panel.background = element_rect(fill = 'white', colour = 'white'),
        panel.grid.major = element_blank(),
        panel.grid.minor = element_blank(),
        legend.position = "bottom",
        strip.background = element_blank(),
        text = element_text(size = 15))


grid.arrange(rif25, rif50, rif75, rif_comb, ncol = 3)

Using the 25th quantile as an example, we will show how the expected value of the recentered influence function equals the \(\tau^{th}\) quantile.

\[RIF(SBP_i;q_{0.25}) = q_{0.25} + \frac{\tau - I(SBP_i \leq q_{0.25})}{f_y(q_{0.25})} = 114 + \frac{\tau - I(SBP_i \leq 114)}{f_y(114)} = \begin{equation} \begin{cases} 75.94, I(SBP_i \leq 114) \\ 126.69, I(SBP_i > 114) \end{cases} \end{equation}\]

At the 25th quantile, the recentered influence function takes on the value 75.94 for the 25% of participants who fall below the threshold (i.e., \(q_{0.25}\) = 114mmHg) and 126.69 for the 75% of participants who fall above the 25th threshold. The weighted average of these values (i.e., the mean value of the RIF) equals 114mmHg, which is the 25th quantile of the unconditional distribution of SBP in our data.

\(E[RIF(SBP_i;q_{0.25})] = (0.25)(75.94) + (0.75)(126.69) = 114 = q_{\tau}\)

Estimating and Interpreting Coefficients

Now that we have estimated the recentered influence function in the unconditional outcome distribution, we regress the RIF on the exposure and covariates. There are three ways of regressing the RIF on the exposure and covariates:

  1. Linear regression using ordinary least squares (OLS) with RIF as the outcome (RIF-OLS)
  2. Logistic regression using \(I(y > q_\tau)\) as the outcome (RIF-Logit)
  3. Polynomial regression to model \(I(y > q_\tau)\) (RIF-NP)

We will focus on RIF-OLS since it is the easiest to implement and the most practical, but rearranging the original recentered influence function to isolate the \(I(y > q_\tau)\) term is used in RIF-Logit and RIF-NP, where

\(RIF(y;q_{\tau}) = q_{\tau} + \frac{\tau - I(y \leq q_{\tau})}{f_y(q_{\tau})} = \frac{1}{f_y(q_{\tau})}I(y > q_\tau) + q_\tau - \frac{1-\tau}{f_y(q_{\tau})}\)

We will go over an example of fitting the RIF-OLS regression for SBP at the 25th quantile of the unconditional distribution. Specifically, we will fit the equation:

\(E[RIF(SBP_i;q_{0.25})|X,C] = \alpha_{0, 0.25} + \alpha_{1, 0.25}x_i + \lambda_{0.25}'C\)

Where \(\lambda_{0.25}\) is a vector of coefficients for the confounders.

If we take the expectation of the expression above, we iterate the expectation on the left hand side of the equation and distribute the expectation of the right hand, leaving us with

\(E[E[RIF(SBP_i;q_{0.25})|X,C]] = E[\alpha_{0, 0.25} + \alpha_{1, 0.25}x_i + \lambda_{0.25}'C]\)

\(E[RIF(SBP_i;q_{0.25})|X,C] = \alpha_{0, 0.25} + \alpha_{1, 0.25}E[X] + \lambda_{0.25}'E[C]\)

Where the left hand side of the equation is the expectation of the RIF (i.e., \(q_{0.25}\)), \(\alpha_{0, 0.25}\) is the intercept of the sample regression line and \(\alpha_{1, 0.25}\) is the slope of the sample regression line.

Firpo’s method “tricks” a regression model into modeling quantiles of the unconditional outcome distribution by using the recentered influence function as the outcome

Coefficients of the RIF-OLS model are interpreted differently than traditional linear regression coefficients. Since \(E[\widehat{RIF(SBP_i;q_{\tau})}] = \widehat{\alpha_{0, \tau}} + \widehat{\alpha_{1, \tau}}E[X] + \widehat{\lambda_{\tau}'}E[C]\), \(\widehat{\alpha_{1, \tau}}\) is the estimated change in the \(\tau^{th}\) quantile of the unconditional distribution of \(Y\) for a unit change in the mean of the unconditional distribution of the exposure \(X\).

RIF-OLS estimates the change in the outcome’s uncondtional distribution for a unit change in the mean of the unconditional distribution of the exposure

In the context of our example, unconditional quantile regression answers the question: By how much does the \(\tau^{th}\) quantile of the unconditional SBP distribution change if the mean of the unconditional education distribution changed by one unit?

Estimating Standard Errors

There are two approaches to estimating the standard error for RIF regressions. The first – and preferred – method is the bootstrap. When bootstrapping, you resample with replacement 500+ times, fit RIF-OLS regression in each sample, then use the 2.5th and 97.5th quantiles of the “sampling distribution” (i.e., the 500+ samples).

The second method is to estimate heteroskedasticity robust standard errors. These can be estimated the same way we estimate heteroskedastic robust standard errors for linear regression, by using \(se(\hat{\beta})=(X'X)^{-1}X' \Omega X(X'X)^{-1}\) where \(\Omega\) is the variance-covariance matrix

\[ \Omega = \left[\begin{array}{cccc} \sigma_1^2 & 0 & ... & 0 \\ 0 & \sigma_2^2 & ... & 0 \\ \vdots & 0 & \ddots & \vdots \\ 0 & ... & ... & \sigma_n^2 \end{array}\right] = \left[\begin{array}{c} \epsilon_1^2 & 0 & ... & 0 \\ 0 & \epsilon_2^2 & ... & 0 \\ \vdots & 0 & \ddots & \vdots \\ 0 & ... & ... & \epsilon_n^2 \end{array}\right] = \left[\begin{array}{c} (y_1-\hat{y_1})^2 & 0 & ... & 0 \\ 0 & (y_2-\hat{y_2})^2 & ... & 0 \\ \vdots & 0 & \ddots & \vdots \\ 0 & ... & ... & (y_n-\hat{y_n})^2 \end{array}\right] \]

UQR in R

We will use the dineq package to run our unconditional quantile regressions using Firpo’s recentered influence function method (RIF-OLS). We will only show bootstrapped confidence intervals since it is the preferred method for standard error adjustments.

RIF-OLS with Bootstrapped Confidence Inverals

data$rif25 <- rif(data$sbp, weights = NULL, method = "quantile",
                  quantile = 0.25)

uqr_25 <- lm(rif25 ~ schlyrs_c + age + age2 + gender + race + southern +
               mom_ed + dad_ed + year, data = data)

#Bootstrap standard error
uqr_25_boot <- lm.boot(uqr_25, R = 50)
boot_sum_25 <- summary(uqr_25_boot)


adj_uqr_25 <- c("Estimate" = boot_sum_25$orig.lm$coefficients["schlyrs_c"],
                "Bootstrapped Std. Error" = boot_sum_25$stdev.params["schlyrs_c"])
adj_uqr_25
##                Estimate.schlyrs_c Bootstrapped Std. Error.schlyrs_c 
##                        -0.4941319                         0.1165962
summary(uqr_25)$coefficients["schlyrs_c", ] #Unadjusted
##      Estimate    Std. Error       t value      Pr(>|t|) 
## -4.941319e-01  1.039239e-01 -4.754749e+00  2.018277e-06

Notice that our non-adjusted “schlyrs_c” standard error is slightly different from the bootstrapped “schlyrs_c” standard error.

UQR Function

Below, we also provide a function that loops through and fits unconditional quantile regressions at the 1st to 99th quantiles of the unconditional outcome distribution.

uqr_func <- function(data){
  
  result <- data.frame()
  
  for(i in seq(0.01, 0.99, by = 0.01)){
    
    i <- round(i, 2)
    data$rif <- rif(data$sbp, weights = NULL, method = "quantile",
                    quantile = i)
    
    set.seed(123)
    uqr_mod <- lm(rif ~ schlyrs_c + age + age2 + gender + race + southern +
                    mom_ed + dad_ed + year, data = data)
    
    
    #Bootstrap standard error
    uqr_boot <- lm.boot(uqr_mod, R = 500)
    boot_sum <- summary(uqr_boot)
    
    uqr_schlyrs_est <- as.numeric(boot_sum$orig.lm$coefficients["schlyrs_c"])
    uqr_schlyrs_sd <- as.numeric(boot_sum$stdev.params["schlyrs_c"])
    
    uqr_row <- data.frame("Quantile" = i,
                          "Estimate" = uqr_schlyrs_est,
                          "Lower" = uqr_schlyrs_est - (1.96*uqr_schlyrs_sd),
                          "Upper" = uqr_schlyrs_est + (1.96*uqr_schlyrs_sd),
                          "regtype" = "UQR")
    
    result <- rbind(result, uqr_row)
    
  }
  
  return(result)
  
}

uqr_results <- uqr_func(data)


#Figure
ggplot(data = uqr_results, aes(x = Quantile, y = Estimate)) +
  geom_line(color = "#FF9300", size = 1) +
  geom_ribbon(aes(x = Quantile, ymin = Lower, ymax = Upper), 
              alpha = 0.45, fill = "#FF9300") +
  geom_hline(yintercept = 0, alpha = 0.5) +
  theme(panel.background = element_rect(fill = 'white', colour = 'white'),
        panel.grid.major = element_blank(),
        panel.grid.minor = element_blank()) +
  labs(x = xaxis,
       y = yaxis,
       title = "Association of educational attainment with systolic blood pressure") +
  scale_x_continuous(breaks = c(0.01, seq(0.10, 0.90, 0.10), 0.99),
                     labels = c("q1", "q10", "q20", "q30", "q40", "q50", 
                                "q60", "q70", "q80", "q90", "q99"))

We can also add our OLS estimate to the UQR plot so viewers can easily compare estimates at the mean to estimates at each quantile of the unconditional distribution.

ols_uqr <- rbind(ols_results, uqr_results)

#Figure
ggplot(data = ols_uqr %>% filter(regtype != "OLS"),
       aes(x = Quantile, y = Estimate, group = regtype,
           color = regtype, fill = regtype)) +
  geom_line(alpha = 50, linewidth = 0.75) +
  geom_ribbon(aes(ymin = Lower, ymax = Upper), alpha = 0.27, color = NA) +
  geom_point(data = ols_uqr %>% filter(regtype == "OLS")) +
  geom_errorbar(data = ols_uqr %>% filter(regtype == "OLS"),
                aes(ymin = Lower, ymax = Upper), width = 0.01) +
  geom_hline(yintercept = 0, alpha = 0.5) +
  geom_vline(xintercept = -0.04, color = "gray", linetype = "dashed") +
  theme(panel.background = element_rect(fill = 'white', colour = 'white'),
        panel.grid.major = element_blank(),
        panel.grid.minor = element_blank(),
        legend.position = "bottom",
        text = element_text(size = 10))  +
  scale_color_manual(breaks = c("OLS", "UQR"),
                     labels = c("OLS", "UQR"),
                     values = c("#F2494C", "#FF9300")) +
  scale_fill_manual(breaks = c("OLS", "UQR"),
                    labels = c("OLS", "UQR"),
                    values = c("#F2494C", "#FF9300"))  +
  scale_x_continuous(limits = c(-0.11, 1),
                     breaks = c(-0.1, 0.01, seq(0.10, 0.90, 0.10), 0.99),
                     labels = c("OLS", "q1", "q10", "q20", "q30", "q40", "q50", 
                                "q60", "q70", "q80", "q90", "q99")) +
  labs(x = "",
       y = yaxis,
       color = "Model", group = "Model", fill = "Model")  +
  theme(legend.position = "bottom",
        legend.box.spacing = unit(0, "pt"),
        legend.margin = margin(-10,0,0,0),
        text = element_text(size = 15))

UQR Counterfactual Disribution

Just like with CQR, we can use model estimates from UQR to create counterfactual distributions of SBP. We impose our estimates at each quantile and plot what the new counterfactual distribution could look like.

#Creates dataset for plotting
counter_data <- function(results, data){
  
  coef <- results %>%
    dplyr::filter(regtype == "UQR") %>%
    dplyr::select(Quantile, Estimate) %>%
    rename("quant" = "Quantile",
           "coef" = "Estimate")
  coef$quant <- round(coef$quant * 100, 1) #Need to round for merging
  
  red <- data %>%
    dplyr::select(sbp) %>%
    mutate(quant = xtile(sbp, n = 99))
  
  merg <- right_join(coef, red, by = "quant")
  
  cf <- merg %>%
    mutate(sbp = sbp + coef,
           group = "Counterfactual") %>%
    select(group, sbp)
  
  f <- merg %>%
    mutate(group = "Factual") %>%
    select(group, sbp)
  
  cf_plot <- rbind(f, cf)
  cf_plot$group <- relevel(factor(cf_plot$group), ref = "Factual")
  
  return(cf_plot)
  
}

#Create counterfactual data
uqr_counter <- counter_data(uqr_results, data)


#Create graphic
ggplot(data = uqr_counter, aes(x = sbp, color = group, linetype = group)) +
  geom_density(aes(y = after_stat(density)), linewidth = 1) +
  theme(panel.background = element_rect(fill = 'white', colour = 'white'),
        panel.grid.major = element_blank(),
        panel.grid.minor = element_blank(),
        legend.position = "bottom",
        strip.background = element_blank(),
        text = element_text(size = 20)) + 
  labs(color = "Distribution",
       linetype = "Distribution",
       x = "SBP (mmHg)",
       y = "Density") + 
  scale_color_manual(values = c("Factual" = "#468C8A", 
                                "Counterfactual" = "#FF9300")) +
  scale_linetype_manual(values = c("Factual" = "solid", 
                                   "Counterfactual" = "dashed"))

Key Takeaways

  1. Estimating the change in the unconditional quantile using conditional quantile regression estimates is difficult  - Using alternative methods like Firpo that estimate the unconditional quantile directly is easy
  2. The RIF-regression method captures the change in quantiles of the unconditional distribution for a small change in the exposure distribution  - Since it includes an indicator function, the RIF will only take on two values
  3. Recentered influence function (RIF) regression involves regressing the RIF at the exposure and other covariates (e.g., using OLS)  - We are “tricking” a regression to model the unconditional quantiles by using the RIF as the outcome

Summarizing Linear Regression, CQR, and UQR

Comparing Estimation Strategies and Intpretiations

We provide a quick table to compare the differences between linear regression, conditional quantile regression, and unconditional quantile regression.

Linear Regression Conditional Quantile Regression Unconditional Quantile Regression
Model \(E[Y|X] = X'\beta\) \(Q_\tau(Y|X)=X'\beta_\tau\) where \(\tau = (0,1)\) \(E[RIF(Y;q_\tau)|X]=X'\beta_\tau\) where \(\tau = (0,1)\)
Error Standard assumptions about the error term (iid, normal distribution) are usually violated No distributional assumption is made about the error No explicit assumption, but likely the same assumptions as the estimation strategy used
Estimating coefficients \(\hat{\beta}=(X'X)^{-1}X'Y\) (analytic solution) \(min_\hat{\beta}\sum_{i=1}^{N} \rho_\tau(y_i - x_i^{'}\hat{\beta})\) (use linear programming methods) Method of estimation depends on the estimator used (e.g., RIF-OLS)
Estimating standard errors Several estimators are available + bootstrap Several estimators are available and key point is that we need to estimate the error density + bootstrap Same estimators available as would be for choice of estimator
Interpretation One unit increase in the independent variable of interest is associated with a \(\hat{\beta}\) unit change in the conditional mean of the outcome One unit increase in the independent variable of interest is associated with a \(\hat{\beta}\) unit change in the \(\tau^{th}\) quantile of the conditional outcome distribution One unit increase in the mean of the independent variable of interest is associated with a \(\hat{\beta}\) unit change in the \(\tau^{th}\) quantile of the unconditional outcome distribution

Comparing Model Estimations

It’s important to note that confidence intervals can get quite wide at the tails of the distribution, due to low density. Because of this, some researchers present only the 10th through the 90th quantiles, but we prefer to present results from the 1st through the 99th quantiles. Finally, remember that

Even if CQR and UQR estimates appear similar, they are estimating different estimands and connot be directly compared!

Contact Information

We’d love to hear feedback or answer any questions you still have! Please feel free to reach out to us through email at any time.

Jilly Hebert
Department of Family and Community Medicine, University of California, San Francisco

Amanda Irish
Department of Family and Community Medicine, University of California, San Francisco

Anusha Vable
Department of Family and Community Medicine, University of California, San Francisco

References

  1. Chernozhukov V, Fernández-Val I, Melly B. Inference on Counterfactual Distributions. Econometrica. 2013;81(6):2205–68.

  2. Firpo S, Fortin NM, Lemieux T. Unconditional Quantile Regressions. Econometrica. 2009;77(3):953–73.

  3. Firpo S, Pinto C. Identification and Estimation of Distributional Impacts of Interventions Using Changes in Inequality Measures. J Appl Econom. 2016;31(3):457–86.

  4. Koenker R, Bassett G. Regression Quantiles. Econometrica. 1978;46(1):33–50.

  5. Koenker R, Chernozhukov V, He X, Peng L, editors. Handbook of Quantile Regression [Internet]. 1st Edition. New York: Chapman & Hall/CRC; 2017 [cited 2022 Sep 7]. 483 p. Available from: https://www.routledge.com/Handbook-of-Quantile-Regression/Koenker-Chernozhukov-He-Peng/p/book/9780367657574

  6. Koenker R. Quantile Regression: 40 Years On. Annu Rev Econ. 2017;9(1):155–76.

  7. Vable AM, Eng CW, Mayeda ER, Basu S, Marden JR, Hamad R, et al. Mother’s education and late-life disparities in memory and dementia risk among US military veterans and non-veterans. J Epidemiol Community Health. 2018 Dec;72(12):1162–7.

  8. Wu Q, Tchetgen Tchetgen EJ, Osypuk TL, White K, Mujahid M, Maria Glymour M. Combining direct and proxy assessments to reduce attrition bias in a longitudinal study. Alzheimer Dis Assoc Disord. 2013 Sep;27(3):207–12.